The Penn Treebank: Annotating Predicate Argument Structure
نویسندگان
چکیده
The Penn Treebank has recently implemented a new syntactic annotation scheme, designed to highlight aspects of predicate-argument structure. This paper discusses the implementation of crucial aspects of this new annotation scheme. It incorporates a more consistent treatment of a wide range of grammatical phenomena, provides a set of coindexed null elements in what can be thought of as "underlying" position for phenomena such as wh-movement, passive, and the subjects of infinitival constructions, provides some non-context free annotational mechanism to allow the structure of discontinuous constituents to be easily recovered, and allows for a clear, concise tagging system for some semantic roles. 1. I N T R O D U C T I O N During the first phase of the The Penn Treebank project [10], ending in December 1992, 4.5 million words of text were tagged for part-of-speech, with about two-thirds of this material also annotated with a skeletal syntactic bracketing. All of this material has been hand corrected after processing by automatic tools. The largest component of the corpus consists of materials from the Dow-Jones News Service; over 1.6 million words of this material has been hand parsed, with an additional 1 million words tagged for part of speech. Also included is a skeletally parsed version of the Brown corpus, the classic million word balanced corpus of American English [5, 6]. hand-retagged using the Penn Treebank tagset. The level of syntactic analysis annotated during this phase of this project was an extended and somewhat modified form of the skeletal analysis which has been produced by the treebanking effort in Lancaster, England [7]. The released materials in the current Penn Treebank, although still in very preliminary form, have been widely distributed, both directly by us, on the ACL/DCI CD-ROM, and now on CD-ROM by the Linguistic Data Consortium; it has been used for purposes ranging from serving as a gold-standard for parser testing to serving as a basis for the induction of stochastic grammars to serving as a basis for quick lexicon induction. Many users of the Penn Treebank now want forms of annotation richer than provided by the project's first phase, as well as an increase in the consistency of the preliminary corpus. Some would also like a less skeletal form of annotation, expanding the essentially context-free analysis of the current treebank to indicate non-contiguous structures and dependencies. Most crucially, there is a strong sense that the Treebank could be of much more use if it explicitly provided some form of predicate-argument structure. The desired level of representation would make explicit at least the logical subject and logical object of the verb, and indicate, at least in clear cases, how subconstituents are semantically related to their predicates. Such a representation could serve as both a starting point for the kinds of SEMEVAL representations now being discussed as a basis for evaluation of human language technology within the ARPA HLT program, and as a basis for "glass box" evaluation of parsing technology. The ongoing effort [1] to develop a standard objective methodology to compare parser outputs across widely divergent grammatical frameworks has now resulted in a widely supported standard for parser comparison. On the other hand, many existing parsers cannot be evaluated by this metric because they directly produce a level of representation closer to predicate-argument structure than to classical surface grammatical analysis. Hand-in-hand with this limitation of the existing Penn Treebank for parser testing is a parallel limitation for automatic methods for parser training for parsers based on deeper representations. There is also a problem of maintaining consistency with the fairly small (less than 100 page) style book used in the the first phase of the project. 2. A N E W A N N O T A T I O N S C H E M E We have recently completed a detailed style-book for this new level of analysis, with consensus across annotators about the particulars of the analysis. This project has taken about eight months of ten-hour a week effort across a significant subset of all the personnel of the Penn Treebank. Such a stylebook, much larger, and much more fully specified than our initial stylebook, is a prerequisite for high levels of interannotator agreement. It is our hope that such a stylebook will also alleviate much of the need for extensive cross-talk between annotators during the annotation task, thereby increasing throughput as well. To ensure that the rules of this new stylebook remain in force, we are now giving annotators about 10% overlapped material to evaluate inter-annotator consistency throughout this new project. We have now begun to annotate this level of structure editing the present Penn Treebank; we intend to automatically extract a bank of predicate-argument structures intended at the very least for parser evaluation from the resulting annotated corpus. The remainder of this paper will discuss the implementation of each of four crucial aspects of the new annotation scheme,
منابع مشابه
Annotating the Propositions in the Penn Chinese Treebank
In this paper, we describe an approach to annotate the propositions in the Penn Chinese Treebank. We describe how diathesis alternation patterns can be used to make coarse sense distinctions for Chinese verbs as a necessary step in annotating the predicate-structure of Chinese verbs. We then discuss the representation scheme we use to label the semantic arguments and adjuncts of the predicates....
متن کاملLinking Flat Predicate Argument Structures
This report presents an approach to enriching flat and robust predicate argument structures with more fine-grained semantic information, extracted from underspecified semantic representations and encoded in Minimal Recursion Semantics (MRS). Such representations are provided by a hand-built HPSG grammar with a wide linguistic coverage. A specific semantic representation, called linked predicate...
متن کاملAutomatic Annotation of the Penn-Treebank with LFG F-Structure Information
Lexical-Functional Grammar f-structures are abstract syntactic representations approximating basic predicate-argument structure. Treebanks annotated with f-structure information are required as training resources for stochastic versions of unification and constraint-based grammars and for the automatic extraction of such resources. In a number of papers (Frank, 2000; Sadler, van Genabith and Wa...
متن کاملCovering Treebanks With GLARF
This paper introduces GLARF, a framework for predicate argument structure. We report on converting the Penn Treebank II into GLARF by automatic methods that achieved about 90% precision/recall on test sentences from the Penn Treebank. Plans for a corpus of hand-corrected output, extensions of GLARF to Japanese and applications for MT are also discussed.
متن کاملAnnotating Predicate-Argument Structure for a Parallel Treebank
Abstract We report on a recently initiated project which aims at building a multi-layered parallel treebank of English and German. Particular attention is devoted to a dedicated predicate-argument layer which is used for aligning translationally equivalent sentences of the two languages. We describe both our conceptual decisions and aspects of their technical realisation. We discuss some select...
متن کاملAutomatic Predicate Argument Structure Analysis of the Penn Chinese Treebank
Recent work in machine translation and information extraction has demonstrated the utility of a level that represents the predicate-argument structure. It would be especially useful for machine translation to have two such Proposition Banks, one for each language under consideration. A Proposition Bank for English has been developed over the last few years, and we describe here our development ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994